CLIMATE 405: Machine Learning for Earth and Environmental Sciences; FALL 2024¶
Mohammed Ombadi (ombadi@umich.edu)¶
Lecture 6 (Monday, 09/16/2024)¶
Topics covered in this lecture:¶
- Clustering: Hierarchical Algorithms
- Clustering: Density-based Algorithms
- Examples of Clustering in Earth and Environmental Sciences
Import libraries¶
# Import libraries
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from scipy.integrate import odeint
from IPython.display import Image
from statistics import mode
from scipy.stats import pearsonr, spearmanr
from sklearn.feature_selection import mutual_info_regression
import statsmodels.api as sm
import pingouin as pg
# Suppress warnings
warnings.filterwarnings('ignore')
# Set number of decimals for np print options
np.set_printoptions(precision=3)
# Set the current working directory
os.chdir(sys.path[0])
Clustering: Hierarchical Algorithms ¶
Hierarchical algorithms can either be aggolmerative (start from individual data points and combine them into clusters) or divisive (start from one large cluster and then divide into smaller groups).
Agglomerative clustering (visual explanation):¶
display(Image(filename = 'Agglomerative clustering_animation.gif', width= 1000, height= 1000))
<IPython.core.display.Image object>
The clustering example above can be visualized as a dendrogram:
display(Image(filename = 'Dendrogram_agglomerative clustering.jpeg', width= 1000, height= 1000))
Agglomerative clustering (sklearn implementation):¶
Scikit-learn documentation is here
Let's apply this on the zoo dataset:
df = pd.read_csv('zoo_data.csv')
df
| animal_name | hair | feathers | eggs | milk | airborne | aquatic | predator | toothed | backbone | breathes | venomous | fins | legs | tail | domestic | catsize | type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | aardvark | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 4 | 0 | 0 | 1 | 1 |
| 1 | antelope | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 4 | 1 | 0 | 1 | 1 |
| 2 | bass | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 4 |
| 3 | bear | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 4 | 0 | 0 | 1 | 1 |
| 4 | boar | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 4 | 1 | 0 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 96 | wallaby | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 2 | 1 | 0 | 1 | 1 |
| 97 | wasp | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 6 | 0 | 0 | 0 | 6 |
| 98 | wolf | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 4 | 1 | 0 | 1 | 1 |
| 99 | worm | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 7 |
| 100 | wren | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 2 | 1 | 0 | 0 | 2 |
101 rows × 18 columns
# Extract features and labels
X= df.iloc[:,1:-1]
y_true = df.iloc[:,-1]
# Perform agglomerative clustering
from sklearn.cluster import AgglomerativeClustering
clustering = AgglomerativeClustering(n_clusters = 7).fit(X)
#print labels
clustering.labels_
array([1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 3, 2, 6, 0, 0, 3, 1, 2, 2, 3, 3,
1, 3, 0, 5, 5, 4, 1, 4, 0, 1, 4, 3, 2, 1, 1, 3, 2, 0, 0, 3, 0, 3,
1, 1, 0, 1, 1, 1, 1, 0, 5, 0, 1, 1, 3, 3, 3, 3, 2, 2, 6, 5, 1, 1,
2, 1, 1, 1, 1, 3, 0, 2, 2, 4, 6, 6, 3, 3, 6, 6, 2, 3, 4, 0, 2, 3,
0, 5, 5, 5, 2, 4, 1, 3, 4, 0, 1, 6, 3])
# Compare true labels with the ones from clustering
y_pred = clustering.labels_
result = pd.DataFrame(np.transpose(np.vstack((y_true, y_pred))), columns= ['y_true', 'y_label'])
result.iloc[:20,:]
| y_true | y_label | |
|---|---|---|
| 0 | 1 | 1 |
| 1 | 1 | 1 |
| 2 | 4 | 2 |
| 3 | 1 | 1 |
| 4 | 1 | 1 |
| 5 | 1 | 1 |
| 6 | 1 | 1 |
| 7 | 4 | 2 |
| 8 | 4 | 2 |
| 9 | 1 | 1 |
| 10 | 1 | 1 |
| 11 | 2 | 3 |
| 12 | 4 | 2 |
| 13 | 7 | 6 |
| 14 | 7 | 0 |
| 15 | 7 | 0 |
| 16 | 2 | 3 |
| 17 | 1 | 1 |
| 18 | 4 | 2 |
| 19 | 1 | 2 |
Clustering: Density-based Algorithms ¶
One example of density-based methods is the DBSCAN (Density-based Spatial Clustering of Applications with Noise) algorithm
Here is a visual demonstration:
display(Image(filename = 'DB_scan_visual.gif', width= 1000, height= 1000))
<IPython.core.display.Image object>
Scikit-learn documentation can be accessed here
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import pdist, squareform
# Generate sample data
X, y = make_moons(n_samples=300, noise=0.1, random_state=0)
# Function to plot data
def plot_data(X, title, ax):
ax.scatter(X[:, 0], X[:, 1], s=10)
ax.set_title(title)
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
# Plot initial data
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
plot_data(X, 'Initial Data', axs[0, 0])
# Compute the distance matrix
dist_matrix = squareform(pdist(X))
# Plot distance matrix
axs[0, 1].imshow(dist_matrix, cmap='hot', interpolation='nearest')
axs[0, 1].set_title('Distance Matrix')
axs[0, 1].set_xlabel('Point Index')
axs[0, 1].set_ylabel('Point Index')
# Apply DBSCAN
eps = 0.2
min_samples = 10
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
labels = dbscan.fit_predict(X)
# Plot clustering result
axs[1, 0].scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=10)
axs[1, 0].set_title('DBSCAN Clustering Result')
axs[1, 0].set_xlabel('Feature 1')
axs[1, 0].set_ylabel('Feature 2')
# Highlight core samples
core_samples_mask = np.zeros_like(labels, dtype=bool)
core_samples_mask[dbscan.core_sample_indices_] = True
axs[1, 1].scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=10)
axs[1, 1].scatter(X[core_samples_mask, 0], X[core_samples_mask, 1], c='red', marker='o', s=50, edgecolor='k', label='Core Samples')
axs[1, 1].set_title('Core Samples Highlighted')
axs[1, 1].set_xlabel('Feature 1')
axs[1, 1].set_ylabel('Feature 2')
axs[1, 1].legend()
plt.tight_layout()
plt.show()
Applications of clustering in Earth and Environmental Scinces ¶
1) Using clustering to determine optimum sampling locations¶
display(Image(filename = 'Van Arkel & Kaleita_WRR2014.png', width= 750, height= 750))
Research questions:¶
- Can K-means clustering be used to effectively identify critical sampling locations in an agricultural field?
- Can the clustering be successful based on input data that does not include soil moisture?
Research site and data used in the study:¶
display(Image(filename = 'Fig1_Van Arkel & Kaleita_WRR2014.jpg', width= 600, height= 600))
Figure 1: Brooks Field sampling grid with elevation (shading, in m) and soil types. Points are on 50 m intervals. Soil type indices, according to the National Cooperative Soil Survey: 55: Nicollet loam, 1−3% slopes; 95: Harps loam, 1−3% slopes; 107: Webster clay loam, 0−2% slopes; 138B: Clarion loam, 2−5% slopes; 138C2: Clarion loam, 5−9% slopes, moderately eroded; 507: Canisteo clay loam, 0−2% slopes
This is an agricultural site in Iowa, with a size of 300*250 meter. Soil moisture measurements were taken during the growing season for 2004 - 2008
K-means clustering was applied with the following input data:¶
- K-means $M_{\theta}$: uses soil moisture data ($\theta$) for the year 2004.
- K-means $M_{T}$: uses topographic data
- K-means $M_{E}$: uses electromagnetic inductance
display(Image(filename = 'Table1_Van Arkel & Kaleita_WRR2014.png', width= 1200, height= 1000))
Evaluation:¶
display(Image(filename = 'Table2_Van Arkel & Kaleita_WRR2014.png', width= 1200, height= 1000))
2) Using clustering to determine groundwater trends during and after the Australian Millennium Drought¶
display(Image(filename = 'Chen et al_ERL2024.png', width= 750, height= 750))
key research questions:¶
1- What are the spatial patterns of groundwater during and post drought?
Rainfall Anomalies during the drought (1997 - 2009)¶
display(Image(filename = 'Fig3_ERL2024.jpg', width= 750, height= 750))
Trends in Groundwater levels (clustered using K-means)¶
display(Image(filename = 'Fig4a_ERL2024.jpg', width= 1200, height= 900))
How did they determine the number of classes (K= 4)??¶
They used the elbow method, which computes the within cluster errors (aka distortion) as a function of K, and then find the inflection point.
display(Image(filename = 'Fig4b_ERL2024.jpg', width= 800, height= 800))